A Fast, Consistent Kernel Two-Sample Test
نویسندگان
چکیده
A kernel embedding of probability distributions into reproducing kernel Hilbert spaces (RKHS) has recently been proposed, which allows the comparison of two probability measures P and Q based on the distance between their respective embeddings: for a sufficiently rich RKHS, this distance is zero if and only if P and Q coincide. In using this distance as a statistic for a test of whether two samples are from different distributions, a major difficulty arises in computing the significance threshold, since the empirical statistic has as its null distribution (where P = Q) an infinite weighted sum of χ random variables. Prior finite sample approximations to the null distribution include using bootstrap resampling, which yields a consistent estimate but is computationally costly; and fitting a parametric model with the low order moments of the test statistic, which can work well in practice but has no consistency or accuracy guarantees. The main result of the present work is a novel estimate of the null distribution, computed from the eigenspectrum of the Gram matrix on the aggregate sample from P and Q, and having lower computational cost than the bootstrap. A proof of consistency of this estimate is provided. The performance of the null distribution estimate is compared with the bootstrap and parametric approaches on an artificial example, high dimensional multivariate data, and text.
منابع مشابه
Exponentially Consistent Kernel Two-Sample Tests
Given two sets of independent samples from unknown distributions P and Q, a two-sample test decides whether to reject the null hypothesis that P = Q. Recent attention has focused on kernel two-sample tests as the test statistics are easy to compute, converge fast, and have low bias with their finite sample estimates. However, there still lacks an exact characterization on the asymptotic perform...
متن کاملTopics in kernel hypothesis testing
This thesis investigates some unaddressed problems in kernel nonparametrichypothesis testing. The contributions are grouped around three main themes:Wild Bootstrap for Degenerate Kernel Tests. A wild bootstrap method for non-parametric hypothesis tests based on kernel distribution embeddings is pro-posed. This bootstrap method is used to construct provably consistent teststh...
متن کاملB-test: A Non-parametric, Low Variance Kernel Two-sample Test
We propose a family of maximum mean discrepancy (MMD) kernel two-sample tests that have low sample complexity and are consistent. The test has a hyperparameter that allows one to control the tradeoff between sample complexity and computational time. Our family of tests, which we denote as B-tests, is both computationally and statistically efficient, combining favorable properties of previously ...
متن کاملTHE COMPARISON OF TWO METHOD NONPARAMETRIC APPROACH ON SMALL AREA ESTIMATION (CASE: APPROACH WITH KERNEL METHODS AND LOCAL POLYNOMIAL REGRESSION)
Small Area estimation is a technique used to estimate parameters of subpopulations with small sample sizes. Small area estimation is needed in obtaining information on a small area, such as sub-district or village. Generally, in some cases, small area estimation uses parametric modeling. But in fact, a lot of models have no linear relationship between the small area average and the covariat...
متن کاملFast Two-Sample Testing with Analytic Representations of Probability Measures
We propose a class of nonparametric two-sample tests with a cost linear in the sample size. Two tests are given, both based on an ensemble of distances between analytic functions representing each of the distributions. The first test uses smoothed empirical characteristic functions to represent the distributions, the second uses distribution embeddings in a reproducing kernel Hilbert space. Ana...
متن کامل